feat: simulation suite runner (npm run sim) by dhruva-vapi · Pull Request #18 · VapiAI/gitops

dhruva-vapi · 2026-05-01T20:01:02Z

ELI5

Problem. The engine could create simulation suites and track
them in state, and AGENTS.md described simulations/suites/ as a
first-class resource type. But there was no npm run command to
actually execute a suite. npm run eval exists but runs the
legacy /evals endpoint — a different thing — and the naming
overlap actively misled engineers into running the wrong command. To
fire a simulation suite from the CLI you had to write raw curl or go
to the dashboard UI (losing reproducibility).

What this fix does. Adds npm run sim. Two shapes:

npm run sim -- <org> --suite <name> --target <assistant-or-squad>
npm run sim -- <org> --simulations <n1>,<n2> --target <assistant>

Resolves local resource names → state-file UUIDs the same way
npm run call does, POSTs /eval/simulation/run, polls the run
status, prints a summary table (pass/fail per simulation, mean run
time, structured-output evals).

Outcome you'll notice. Simulation suites become a normal part of
the gitops workflow: author the suite as YAML, push it via
npm run push, run it via npm run sim. No more dashboard
clicking. Note the AGENTS.md call-out clarifying the difference
between npm run sim (unified /eval/simulation/*) and
npm run eval (legacy /evals) — renaming eval to disambiguate
is a separate, backwards-incompatible follow-up.

Engine fully tracks simulation suites in state and AGENTS.md describes
simulations/suites/ as a first-class resource type, but there's no
npm run command to actually execute one. npm run eval runs the legacy
/evals endpoint, not the unified simulation runner. Customers go to
the dashboard UI to trigger runs (losing reproducibility) or write
per-customer shell wrappers.

src/sim.ts (NEW): runSimulationSuite + runSimulationsByName helpers.
Resolves local-name → UUID via state file; POSTs /eval/simulation/run;
polls /eval/simulation/run/:id until completion; prints pass/fail
summary per simulation with mean run time + structured-output evals.
Reuses src/api.ts:vapiRequest for HTTP and the local-name → UUID
resolution pattern from src/eval.ts.
src/sim-cmd.ts (NEW): CLI entry. Args:
npm run sim -- --suite --target
npm run sim -- --simulations , --target
npm run sim -- --suite --watch
package.json: sim script.
AGENTS.md: document npm run sim alongside npm run eval (call out the
legacy /evals vs unified /eval/simulation/* distinction).
tests/sim.test.ts: arg parsing, UUID resolution, status polling,
summary table formatting.

Note: renaming npm run eval to disambiguate is a follow-up — that's a
backwards-incompatible script-name change. For now the AGENTS.md note
calls out the distinction.

Closes improvements.md #16.

🤖 Generated with Claude Code

dhruva-vapi · 2026-05-01T20:01:14Z

**Problem.** The Vapi API rejects bad configs at PATCH time with terse 400s ("property speed should not exist") — and by then the push has already partially completed against other resources. We watched the same five classes of mistake hit production over and over: 1. Assistant names (or eval names) longer than 40 chars (silent cap). 2. Structured-output ↔ assistant lockstep mismatch — one side declares the relationship, the other doesn't, dashboard ends up inconsistent. 3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt with two identical headers stacked, agent follows both). 4. `maxTokens` set lower than the JSON-schema size of the attached tools' arguments — assistant looks fine on push, bricks on first tool-using call. 5. Voice fields nested wrong for the provider (`voice.speed` on Cartesia, where it lives at `voice.generationConfig.speed`). **What this fix does.** Five client-side validators, all running off the same `LoadedResources` shape that `push.ts` would actually ship — so the lint runs against exactly what would be pushed, no separate parser to drift. Surfaces as warnings by default (one bad spec doesn't block an otherwise-good push); promote to abort with `--strict`. Run standalone via `npm run validate -- <org>`. **Outcome you'll notice.** Most schema-class mistakes get caught locally in seconds instead of mid-push 400s. Voice provider field mismatch gets a specific message pointing at the right path. CI can add `npm run push -- <env> --strict` as a gate before any deploy. --- Catch the classes of errors that today only surface when the API returns a 400 mid-push. The push pipeline runs validation in warn-only mode by default; --strict promotes errors to a blocking abort before any API call. Standalone runner via `npm run validate -- <org>`. Validators implemented: 1. Name length cap (40 chars). Walks every assistant.name and every evaluations[].structuredOutput.name in scenarios. Closes #18. 2. SO ↔ assistant bidirectional lockstep. For every SO file's assistant_ids, checks the named assistant's structuredOutputIds mirrors it; reverse direction too. Closes #11. 3. Prompt duplication heuristics. Same H1 heading appearing twice, repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks. Partial fix for #8 (paste-on-top dashboard duplications). 4. maxTokens floor for tool-using assistants. Computes floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters))) per attached tool. Warns under floor. Closes #19. 5. Per-provider voice schema. Cartesia rejects top-level speed / stability / similarityBoost / enableSsmlParsing (point at generationConfig.* / drop the field). 11labs rejects generationConfig (it's a Cartesia path). Closes #9 (engine half). - src/validate.ts (NEW): validateResources(loadedResources) returning ValidationFinding[] with severity / type / resourceId / rule / message / fieldPath. Pure data; safe to test directly. - src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as push.ts so the lint runs against exactly what would ship. Exit non-zero on any error finding. - src/config.ts: --strict flag. - src/push.ts: validators run in default-warn mode; --strict aborts. - package.json: validate script. - AGENTS.md: document npm run validate and --strict. - tests/validate.test.ts: per-rule fixtures (golden + bad inputs) covering all five checks. Closes improvements.md #11, #18, #19. Resolves engine half of #9. Partial #8, #20 (heuristic only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)

dhruva-vapi · 2026-05-05T02:03:59Z

Merge activity

May 5, 2:03 AM UTC: A user started a stack merge that includes this pull request via Graphite.
May 5, 2:05 AM UTC: Graphite couldn't merge this pull request because a downstack PR feat: validate command with five fail-fast schema/lockstep/shape checks #17 failed to merge.
May 5, 2:15 AM UTC: A user started a stack merge that includes this pull request via Graphite.
May 5, 2:15 AM UTC: @dhruva-reddy merged this pull request with Graphite.

**Problem.** The Vapi API rejects bad configs at PATCH time with terse 400s ("property speed should not exist") — and by then the push has already partially completed against other resources. We watched the same five classes of mistake hit production over and over: 1. Assistant names (or eval names) longer than 40 chars (silent cap). 2. Structured-output ↔ assistant lockstep mismatch — one side declares the relationship, the other doesn't, dashboard ends up inconsistent. 3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt with two identical headers stacked, agent follows both). 4. `maxTokens` set lower than the JSON-schema size of the attached tools' arguments — assistant looks fine on push, bricks on first tool-using call. 5. Voice fields nested wrong for the provider (`voice.speed` on Cartesia, where it lives at `voice.generationConfig.speed`). **What this fix does.** Five client-side validators, all running off the same `LoadedResources` shape that `push.ts` would actually ship — so the lint runs against exactly what would be pushed, no separate parser to drift. Surfaces as warnings by default (one bad spec doesn't block an otherwise-good push); promote to abort with `--strict`. Run standalone via `npm run validate -- <org>`. **Outcome you'll notice.** Most schema-class mistakes get caught locally in seconds instead of mid-push 400s. Voice provider field mismatch gets a specific message pointing at the right path. CI can add `npm run push -- <env> --strict` as a gate before any deploy. --- Catch the classes of errors that today only surface when the API returns a 400 mid-push. The push pipeline runs validation in warn-only mode by default; --strict promotes errors to a blocking abort before any API call. Standalone runner via `npm run validate -- <org>`. Validators implemented: 1. Name length cap (40 chars). Walks every assistant.name and every evaluations[].structuredOutput.name in scenarios. Closes #18. 2. SO ↔ assistant bidirectional lockstep. For every SO file's assistant_ids, checks the named assistant's structuredOutputIds mirrors it; reverse direction too. Closes #11. 3. Prompt duplication heuristics. Same H1 heading appearing twice, repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks. Partial fix for #8 (paste-on-top dashboard duplications). 4. maxTokens floor for tool-using assistants. Computes floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters))) per attached tool. Warns under floor. Closes #19. 5. Per-provider voice schema. Cartesia rejects top-level speed / stability / similarityBoost / enableSsmlParsing (point at generationConfig.* / drop the field). 11labs rejects generationConfig (it's a Cartesia path). Closes #9 (engine half). - src/validate.ts (NEW): validateResources(loadedResources) returning ValidationFinding[] with severity / type / resourceId / rule / message / fieldPath. Pure data; safe to test directly. - src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as push.ts so the lint runs against exactly what would ship. Exit non-zero on any error finding. - src/config.ts: --strict flag. - src/push.ts: validators run in default-warn mode; --strict aborts. - package.json: validate script. - AGENTS.md: document npm run validate and --strict. - tests/validate.test.ts: per-rule fixtures (golden + bad inputs) covering all five checks. Closes improvements.md #11, #18, #19. Resolves engine half of #9. Partial #8, #20 (heuristic only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)

## ELI5 **Problem.** The engine could *create* simulation suites and track them in state, and AGENTS.md described `simulations/suites/` as a first-class resource type. But there was no `npm run` command to actually *execute* a suite. `npm run eval` exists but runs the *legacy* `/evals` endpoint — a different thing — and the naming overlap actively misled engineers into running the wrong command. To fire a simulation suite from the CLI you had to write raw curl or go to the dashboard UI (losing reproducibility). **What this fix does.** Adds `npm run sim`. Two shapes: ``` npm run sim -- <org> --suite <name> --target <assistant-or-squad> npm run sim -- <org> --simulations <n1>,<n2> --target <assistant> ``` Resolves local resource names → state-file UUIDs the same way `npm run call` does, POSTs `/eval/simulation/run`, polls the run status, prints a summary table (pass/fail per simulation, mean run time, structured-output evals). **Outcome you'll notice.** Simulation suites become a normal part of the gitops workflow: author the suite as YAML, push it via `npm run push`, run it via `npm run sim`. No more dashboard clicking. Note the AGENTS.md call-out clarifying the difference between `npm run sim` (unified `/eval/simulation/*`) and `npm run eval` (legacy `/evals`) — renaming `eval` to disambiguate is a separate, backwards-incompatible follow-up. --- Engine fully tracks simulation suites in state and AGENTS.md describes simulations/suites/ as a first-class resource type, but there's no npm run command to actually execute one. npm run eval runs the legacy /evals endpoint, not the unified simulation runner. Customers go to the dashboard UI to trigger runs (losing reproducibility) or write per-customer shell wrappers. - src/sim.ts (NEW): runSimulationSuite + runSimulationsByName helpers. Resolves local-name → UUID via state file; POSTs /eval/simulation/run; polls /eval/simulation/run/:id until completion; prints pass/fail summary per simulation with mean run time + structured-output evals. Reuses src/api.ts:vapiRequest for HTTP and the local-name → UUID resolution pattern from src/eval.ts. - src/sim-cmd.ts (NEW): CLI entry. Args: npm run sim -- <org> --suite <name> --target <assistant-or-squad> npm run sim -- <org> --simulations <n1>,<n2> --target <assistant> npm run sim -- <org> --suite <name> --watch - package.json: sim script. - AGENTS.md: document npm run sim alongside npm run eval (call out the legacy /evals vs unified /eval/simulation/* distinction). - tests/sim.test.ts: arg parsing, UUID resolution, status polling, summary table formatting. Note: renaming npm run eval to disambiguate is a follow-up — that's a backwards-incompatible script-name change. For now the AGENTS.md note calls out the distinction. Closes improvements.md #16. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

This was referenced May 1, 2026

refactor: state schema with per-resource content hashes #19

Merged

feat: snapshot-on-push + npm run rollback #21

Merged

feat: drift detection on push (--overwrite to bypass) #20

Merged

refactor: scoped state writes preserve untouched entries #22

Merged

dhruva-vapi force-pushed the dhruva-reddy/feat/validate-command branch from b62592a to cb8079a Compare May 1, 2026 22:56

dhruva-vapi force-pushed the dhruva-reddy/feat/sim-runner branch from 8468fc9 to 4e55f1f Compare May 1, 2026 22:56

adhamvapi approved these changes May 2, 2026

View reviewed changes

dhruva-vapi force-pushed the dhruva-reddy/feat/validate-command branch from cb8079a to b1f91f7 Compare May 2, 2026 01:21

dhruva-vapi force-pushed the dhruva-reddy/feat/sim-runner branch from 4e55f1f to 346fbf7 Compare May 2, 2026 01:22

dhruva-vapi force-pushed the dhruva-reddy/feat/validate-command branch from b1f91f7 to 3558d10 Compare May 2, 2026 01:27

dhruva-vapi force-pushed the dhruva-reddy/feat/sim-runner branch from 346fbf7 to 7e5eb7f Compare May 2, 2026 01:27

dhruva-vapi force-pushed the dhruva-reddy/feat/validate-command branch from 3558d10 to bcd23de Compare May 2, 2026 01:31

dhruva-vapi force-pushed the dhruva-reddy/feat/validate-command branch from bcd23de to 5cb218f Compare May 2, 2026 01:31

dhruva-vapi force-pushed the dhruva-reddy/feat/sim-runner branch from 7e5eb7f to ffa91d9 Compare May 2, 2026 01:31

dhruva-vapi changed the base branch from dhruva-reddy/feat/validate-command to graphite-base/18 May 5, 2026 02:04

dhruva-vapi closed this in #17 May 5, 2026

dhruva-vapi reopened this May 5, 2026

dhruva-vapi force-pushed the graphite-base/18 branch from 5cb218f to cd00da7 Compare May 5, 2026 02:14

dhruva-vapi force-pushed the dhruva-reddy/feat/sim-runner branch from ffa91d9 to bfc2bac Compare May 5, 2026 02:14

dhruva-vapi changed the base branch from graphite-base/18 to main May 5, 2026 02:14

dhruva-vapi merged commit 43380d6 into main May 5, 2026
1 check passed

dhruva-vapi deleted the dhruva-reddy/feat/sim-runner branch May 11, 2026 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: simulation suite runner (npm run sim)#18

feat: simulation suite runner (npm run sim)#18
dhruva-vapi merged 1 commit into
mainfrom
dhruva-reddy/feat/sim-runner

dhruva-vapi commented May 1, 2026

Uh oh!

dhruva-vapi commented May 1, 2026 •

edited

Loading

Uh oh!

dhruva-vapi commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhruva-vapi commented May 1, 2026

ELI5

Uh oh!

dhruva-vapi commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhruva-vapi commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruva-vapi commented May 1, 2026 •

edited

Loading

dhruva-vapi commented May 5, 2026 •

edited

Loading